hubEnsembles: Ensembling Methods in RPredictions of future outcomes are essential to planning and decision making, yet generating reliable predictions of the future is challenging. One method for overcoming this challenge is combining predictions across multiple, independent models. These combination methods (also called aggregation or ensembling) have been repeatedly shown to produce predictions that are more accurate (Clemen 1989; Timmermann 2006) and more consistent (Hibon and Evgeniou 2005) than individual models. Because of the clear performance benefits, multi-model ensembles are commonplace across fields, including weather (Alley, Emanuel, and Zhang 2019), climate (Tebaldi and Knutti 2007), and economics (Aastveit et al. 2018). More recently, multi-model ensembles have been used to improve predictions of infectious disease outbreaks (Viboud et al. 2018; Johansson et al. 2019; McGowan et al. 2019; Cramer et al. 2022).
Across this vast literature, there are many proposed methods for generating ensembles. Generally, these methods differ in at least one of two ways: (1) the function used to combine or “average” predictions, and (2) how predictions are weighted when performing the combination. No one method is universally “the best”; a simple average of predictions works surprisingly well across a range of settings (Winkler 2015), but more complex approaches have also been shown to have benefits (McAndrew et al. 2021). Here, we present the hubEnsembles package, which provides a flexible framework for generating ensemble predictions from multiple models. Complementing other software for combining predictions from multiple models (Pedregosa et al. 2011; Weiss, Raviv, and Roetzer 2019; Bosse et al. 2023; Couch and Kuhn 2023), hubEnsembles supports multiple types of predictions from point estimates to probabilistic predictions. Throughout, we will use the term “prediction” to refer to any kind of model output that may be combined including a forecast, a scenario projection, or a parameter estimate.
The hubEnsembles package is part of a larger collection of open-source software and data tools that enables collaborative modelling exercises called the “hubverse” (https://hubdocs.readthedocs.io/en/latest/index.html). The broader “hubverse” initiative is motivated by the demonstrated benefits of collaborative hubs (Reich et al. 2022), including performance improvements of multi-model ensembles and the desire for standardization across such hubs. In this paper, we focus specifically on the functionality encompassed in hubEnsembles. We provide an overview of the methods implemented, including mathematical definitions and properties (Section 2) as well as implementation details (Section 3), we give simple examples to demonstrate the functionality (Section 4) and a more complex case study (Section 5) that motivates a discussion and comparison of the various methods (Section 6).
The hubEnsembles package supports both point predictions and probabilistic predictions of different formats. A point prediction is a single estimate of a future outcome, and a probabilistic prediction gives an estimated probability distribution over future outcomes. We use \(N\) to denote the total number of individual predictions that the ensemble will combine. For example, these predictions will often be produced by different statistical or mathematical models, and \(N\) is the total number of models. Individual predictions will be indexed by the subscript \(i\). Optionally, the package allows for calculating ensembles that use a weight \(w_i\) for each prediction. Informally, predictions with a larger weight have a greater influence on the value of the ensemble prediction, though the details of this depend on the ensemble method (described more below).
For a set of point predictions, \(p_i\) each from a distinct model \(i\), the hubEnsembles package can compute an ensemble of these predictions
\[ p_E = C(p_i, w_i) \]
using any function \(C\), and model-specific weights \(w_i\). For example, an arithmetic average of predictions yields \(p_E = \sum_{i=1}^Np_iw_i\), where the weights are non-negative and sum to 1. If \(w_i = 1/N\) for all \(i\), all predictions will be equally weighted. This framework can also support more complex functions for aggregation, such as a (weighted) median or geometric mean.
For probabilistic predictions, there are two commonly used classes of methods to average or ensemble multiple predictions: quantile averaging (also called a Vincent average (Vincent 1912)) and probability averaging (also called a distributional mixture or linear opinion pool (Stone 1961)) (Lichtendahl, Grushka-Cockayne, and Winkler 2013). To define these two classes of methods, let \(F(x)\) be a cumulative density function (CDF) defined over values \(x\) of the target variable for the prediction, and \(F^{-1}(\theta)\) be the corresponding quantile function defined over quantile levels \(\theta \in [0, 1]\). Additionally, we will use \(f(x)\) to denote a probability mass function (PMF) for a prediction of a discrete variable or a discretization (such as binned values) of a continuous variable.
The quantile average is calculated as \[ F^{-1}_Q(\theta) = \sum_{i = 1}^Nw_iF^{-1}_i(\theta). \] This computes the average value of predictions across different models for each fixed quantile level \(\theta\). It is also possible to use other combination functions, such as a weighted median, to combine quantile predictions.
The probability average or linear pool is calculated by averaging probabilities across predictions for a fixed value of the target variable, \(x\). This can be expressed in terms of either predictive CDFs or PDFs as follows: \[\begin{align*} F_{LOP}(x) &= \sum_{i = 1}^Nw_iF_i(x), \\ f_{LOP}(x) &= \sum_{i = 1}^Nw_if_i(x). \end{align*}\]
The different averaging methods for probabilistic predictions yield different properties of the resulting ensemble distribution. For example, the variance of the linear pool is \(\sigma^2_{LOP} = \sum_{i=1}^Nw_i\sigma_i^2 + \sum_{i=1}^Nw_i(\mu_i-\mu_{LOP})^2\), where \(\mu_i\) is the mean and \(\sigma^2_i\) is the variance of individual prediction \(i\), and although there is no closed-form variance for the quantile average, the variance of the quantile average will always be less than or equal to that of the linear pool (Lichtendahl, Grushka-Cockayne, and Winkler 2013). The linear pool method preserves variation between individual models, whereas the quantile average cancels away this variation (Howerton et al. 2023). Both methods generate distributions with the same mean, \(\mu_Q = \mu_{LOP} = \sum_{i=1}^Nw_i\mu_i\), which is the mean of individual model means (Lichtendahl, Grushka-Cockayne, and Winkler 2013).
To understand how these methods are implemented in hubEnsembles, we first must define the conventions employed by the hubverse and its packages for representing and working with model predictions. We begin with a short overview of concepts and conventions needed to utilize the hubEnsembles package, then explain the implementation of the two ensembling functions provided by the package, simple_ensemble and linear_pool.
Model output is a central concept in the hubEnsembles package which generally refers to a specially formatted tabular representation of predictions produced by a modeling team. Each row represents a single, unique prediction with each column providing information about what is being predicted, its scope, and its value. Per hubverse convention, the columns may be broken down into task IDs, specification of the model output representation, and the model ID (Infectious Disease Modeling Hubs 2022).
At minimum the task IDs (also called task ID variables) together specify the desired outcome being predicted, but they may also include additional information, such as any conditions or assumptions that were used to generate the predictions (Infectious Disease Modeling Hubs 2022). For example, short-term forecasts of incident influenza hospitalizations in the US at different locations and amounts of time in the future (see Table 3.1 below) might represent this information using a target column with the value “wk ahead inc flu hosp”, a location column identifying the location being predicted, a reference_date column with the “starting point” of the forecasts, and a horizon column with the number of steps ahead that the forecast is predicting relative to the reference_date. All these variables make up the task ID columns (Infectious Disease Modeling Hubs 2022).
| model_id | reference_date | location | horizon | target | output_type | output_type_id | value |
|---|---|---|---|---|---|---|---|
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.010 | 161.7159 |
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.025 | 271.6114 |
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.050 | 408.8449 |
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.100 | 551.9048 |
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.150 | 642.7793 |
| CMU-TimeSeries | 2023-05-15 | US | 1 | wk ahead inc flu hosp | quantile | 0.200 | 712.5255 |
Alternatively, longer-term scenario projections for incident COVID-19 hospitalizations in the US at different locations, amounts of time in the future, and under different assumed conditions may use the following task ID columns: a target of “incident COVID-19 hospitalizations”, a location column specifying the location being predicted, an origin_date on which the projections were made, a horizon describing the number of steps ahead that the forecast is predicting relative to the origin_date, and a scenario_id denoting under what circumstances would likely result in the protected number of incident hospitalizations. Different modeling efforts may use different sets of task ID columns and values to specify their prediction goals. Additional examples of task ID variables are available on the hubverse documentation website.
The model output representation includes the predicted values along with metadata that specifies how the predictions are conveyed, and consists of three columns: (1) output_type, (2) output_type_id, and (3) value. Unlike for the task IDs, these three columns are required and their names are fixed, as can be seen in Table 3.1 (Infectious Disease Modeling Hubs 2022). The output_type defines how the prediction is represented and may be one of "mean" or "median" (point forecasts), "quantile", "cdf", "pmf" (distributional forecasts), or "sample" (although this output type is not yet supported by the hubEnsembles package). The output_type_id provides more identifying information for a prediction and is specific to the particular output_type (see below). For quantile predictions, the output_type_id is a numeric value between 0 and 1 specifying the probability level for the quantile. In the notation we defined above, the output_type_id corresponds to \(\theta\) and the value of the prediction is the quantile estimate \(F^{-1}(\theta)\). For CDF or PMF predictions, the output_type_id is the value \(x\) at which the cumulative distribution function or probability mass function for the predictive distribution should be evaluated, and the value column contains the estimate \(F(x)\) or \(f(x)\), respectively. Requirements for the values of the output_type_id and value columns associated with each valid output type are summarized on the hubverse documentation website and in the table below.
Finally, the model_id column gives a unique identifier of the model that created the predictions.
output_type |
output_type_id |
value |
|---|---|---|
mean |
NA (not used for mean predictions) | Numeric: The mean of the predictive distribution |
median |
NA (not used for median predictions) | Numeric: The median of the predictive distribution |
quantile |
Numeric between 0.0 and 1.0: A probability level | Numeric: The quantile of the predictive distribution at the probability level specified by the output_type_id |
cdf |
Numeric within the support of the outcome variable: a possible value of the target variable | Numeric between 0.0 and 1.0: The value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id |
pmf |
String naming a possible category of a discrete outcome variable | Numeric between 0.0 and 1.0: The value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable |
sample |
Positive integer sample index | Numeric: A sample from the predictive distribution |
The simple_ensemble function directly computes an ensemble from component model outputs by combining them via some function (\(C\)) within each unique combination of task ID variables, output types, and output type IDs. This function can be used to summarize predictions of output types mean, median, quantile, CDF, and PMF. The mechanics of the ensemble calculations are the same for each of the output types, though the resulting statistical ensembling method differs for different output types as described below. An aggregation function \(C\) of choice may be specified by the user.
By default, simple_ensemble uses the mean for the aggregation function \(C\) and equal weights for all models. For point predictions with a mean or median output type, the resulting ensemble prediction is an equally weighted average of the individual models’ predictions. For probabilistic forecasts in a quantile format, by default simple_ensemble produces a quantile average, and for model outputs in a CDF or PMF format, by default simple_ensemble computes an equally weighted linear opinion pool.
A median ensemble may also be created by specifying “median” as the aggregation function, or a custom function may be passed to the agg_fun argument to create other ensemble types. Similarly, model weights can be specified to create a weighted ensemble. These weights are allowed to vary for different task ID values, and for quantile forecasts the weights may also be different for different quantile probability levels (corresponding to values of the output_type_id column).
The linear_pool function implements the linear opinion pool method for ensembling projections. This function can be used to combine predictions with output types mean, quantile, CDF, and PMF. Unlike simple_ensemble, this function handles its computation differently based on the output type.
For the CDF, PMF, and mean output types, the linear pool method is equivalent to calling simple_ensemble with a mean aggregation function, since simple_ensemble produces a linear pool prediction for those output types.
However, implementation of LOP is less straightforward for the quantile output type. The reason for this is that LOP averages CDF values (probability) at each value of the target variable, but the predictions are quantiles (on the scale of the target variable) at fixed probability levels. The value for these quantile predictions will generally differ between models, and as a result we are typically not provided with CDF values at the same values of \(x\) from all component predictions. The lack of alignment between CDF values for the same probability levels is illustrated in Figure 3.1 below.
Figure 3.1: (panel A) Example of quantile output type predictions. In this example, points show model output collected for seven fixed quantiles {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, and 0.99} from two distributions (\(N(100, 10)\) in purple and \(N(120, 5)\) in green, both with underlying cumulative distribution functions, CDFs, shown with light lines). The associated values for each fixed quantile do not align across distributions (vertical lines). (panel B) Quantile average ensemble, which is calculated by averaging values for each fixed quantile (represented by horizontal dashed gray lines). The distributions from panel A are re-plotted and the black line shows the resulting quantile average ensemble. Inset shows corresponding probability density functions (PDFs). (panel C) Linear pool ensemble, which is calculated by averaging cumulative probabilities (i.e., quantiles) for each fixed value (represented by vertical dashed gray lines). To calculate the linear pool when model output is not defined for the same values (i.e., for quantile output type), model output is used to estimate the full CDFs for each distribution (solid lines), which are then used to define new quantile-value pairs that are defined for the same values. Thus, the points in panel C differ from those in panels A and B. The distributions from panel A are re-plotted and the black line shows the resulting linear pool ensemble. Inset shows corresponding PDFs.
Given that LOP cannot be directly calculated from quantile predictions, we must first obtain an estimate of the CDF for each component distribution using the provided quantiles, combine the CDFs, then calculate the quantiles from the ensemble’s CDF. We perform this calculation in three main steps, assisted by the distfromq package (Ray and Gerding 2024) for the first two:
For step 1, functionality in the distfromq package uses a monotonic cubic spline for interpolation on the interior of the provided quantiles. The user may choose one of several distributions to perform extrapolation of the CDF tails. These include normal, lognormal, and cauchy distributions, with “normal” set as the default. A location-scale parameterization is used, with separate location and scale parameters chosen in the lower and upper tails so as to match the two most extreme quantiles (Ray and Gerding 2024).
In this section, we illustrate the two main functions in hubEnsembles, simple_ensemble and linear_pool, which requires the following R packages:
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(cowplot)
library(hubUtils)
library(hubEnsembles)
The example-simple-forecast-hub has been created by the Consortium of Infectious Disease Modeling Hubs as a simple example hub to demonstrate the set up and functionality for the hubverse. The hub includes both example model output data and target data. The model output data includes quantile, mean and median forecasts of future incident hospitalizations, and pmf forecasts of the probability that the change in hospitalizations will be a “large decrease”, “decrease”, “stable”, “increase”, “large increase”. Each forecast is made for X task ID variables, including the date the forecast was made (origin_date), the number of steps ahead (horizon), the location for which the forecast was made (location), and the forecast target (target).
First we load the example model output data and the target data using the connect_hub() function from hubUtils, another package developed by the Consortium of Infectious Disease Modeling Hubs.
hub_path <- system.file("example-data/example-simple-forecast-hub",
package = "hubEnsembles")
model_outputs <- hubUtils::connect_hub(hub_path) %>%
dplyr::collect()
head(model_outputs)
## # A tibble: 6 × 8
## origin_date horizon location target output_type output_type_id value model_id
## <date> <int> <chr> <chr> <chr> <dbl> <int> <chr>
## 1 2022-12-05 -6 20 inc co… quantile 0.01 22 UMass-ar
## 2 2022-12-05 -6 20 inc co… quantile 0.025 24 UMass-ar
## 3 2022-12-05 -6 20 inc co… quantile 0.05 26 UMass-ar
## 4 2022-12-05 -6 20 inc co… quantile 0.1 28 UMass-ar
## 5 2022-12-05 -6 20 inc co… quantile 0.15 30 UMass-ar
## 6 2022-12-05 -6 20 inc co… quantile 0.2 32 UMass-ar
target_data_path <- file.path(hub_path, "target-data", "covid-hospitalizations.csv")
target_data <- read.csv(target_data_path) %>%
mutate(time_idx = as.Date(time_idx))
head(target_data)
## time_idx location value target
## 1 2021-03-21 46 12 inc covid hosp
## 2 2021-03-04 45 82 inc covid hosp
## 3 2021-02-26 46 7 inc covid hosp
## 4 2021-02-20 44 21 inc covid hosp
## 5 2021-02-09 44 19 inc covid hosp
## 6 2021-01-25 25 224 inc covid hosp
simple_ensembleWe can generate an equally weighted mean ensemble for each unique combination of values for the task ID variables (here, origin_date, horizon, location, and target), the output_type and the output_type_id. Note that this means that different ensemble methods are used for different output types: for the quantile output type in our example data, the resulting ensemble is a quantile average, while for the pmf output type, the ensemble is a linear pool.
mean_ens <- hubEnsembles::simple_ensemble(model_outputs)
head(mean_ens)
## # A tibble: 6 × 8
## model_id origin_date horizon location target output_type output_type_id value
## <chr> <date> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 hub-ense… 2022-12-05 -6 20 inc c… median NA 37.3
## 2 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.01 14.7
## 3 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.025 15.7
## 4 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.05 17
## 5 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.1 18.3
## 6 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.15 21.7
We can change the function that is used to aggregate model outputs. For example, we may want to calculate a median of component model submitted values for each quantile. We do so by specifying agg_fun = median. We will also use model_id = "hub-ensemble-median to change the name of this ensemble in the resulting data frame.
median_ens <- hubEnsembles::simple_ensemble(model_outputs,
agg_fun = median,
model_id = "hub-ensemble-median")
head(median_ens)
## # A tibble: 6 × 8
## model_id origin_date horizon location target output_type output_type_id value
## <chr> <date> <int> <chr> <chr> <chr> <dbl> <int>
## 1 hub-ense… 2022-12-05 -6 20 inc c… median NA 37
## 2 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.01 22
## 3 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.025 23
## 4 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.05 25
## 5 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.1 27
## 6 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.15 28
Custom functions can also be passed into the agg_fun argument. For example, in some circumstances a geometric mean may be a more appropriate way to combine component model outputs; here we define a custom function geometric_mean() to do so. Any custom function to be used requires an argument x for the vector of numeric values to summarize, and if relevant, an argument w of numeric weights. Then, we can use this custom function to ensemble the component model outputs, again using agg_fun = geometric_mean.
geometric_mean <- function(x){
n <- length(x)
return(prod(x)^(1/n))
}
geometric_mean_ens <- hubEnsembles::simple_ensemble(model_outputs,
agg_fun = geometric_mean,
model_id = "hub-ensemble-geometric")
head(geometric_mean_ens)
## # A tibble: 6 × 8
## model_id origin_date horizon location target output_type output_type_id value
## <chr> <date> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 hub-ense… 2022-12-05 -6 20 inc c… median NA 37.3
## 2 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.01 0
## 3 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.025 0
## 4 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.05 0
## 5 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.1 0
## 6 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.15 18.0
As expected, the mean, median, and geometric mean each give us slightly different resulting ensembles. The median point estimates in Figure 4.1 demonstrate this.
plt_all_point_ens <- bind_rows(mean_ens,
median_ens,
geometric_mean_ens) %>%
filter(location == "US",
origin_date == "2022-12-12",
output_type == "median") %>%
mutate(target_date = origin_date + horizon,
ens_flag = model_id)
plt_all_point_mod <- model_outputs %>%
filter(location == "US",
origin_date == "2022-12-12",
output_type == "median") %>%
mutate(target_date = origin_date + horizon)
ggplot(data = plt_all_point_ens) +
geom_point(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_line(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_vline(aes(xintercept = as.Date("2022-12-06"))) +
geom_line(data = plt_all_point_mod,
aes(x = target_date, y = value, color = model_id), linewidth = 0.9) +
geom_line(aes(x = target_date, y = value, color = model_id), linewidth = 0.9) +
facet_grid(rows = vars(ens_flag)) +
scale_color_manual(values = c(rainbow(3), gray(seq(0.2, 0.8,length.out = 3)))) +
scale_y_continuous(labels = comma,
name = "US incident hospitalizations") +
theme_bw() +
theme(axis.title.x = element_blank(),
legend.position = "bottom",
legend.title = element_blank())
Figure 4.1: Plot of three different ensembles (arithmetic mean, geometric mean, and median) for daily COVID-19 incident hospitalizations combining three existing models: simple_hub-baseline (black), UMass-ar (dark gray), and UMass-gbq (light gray). Truth data is plotted in black with daily values shown as points.
In addition, we can weight the contributions of each model by providing a data.frame that specifies these weights. For example, if we wanted to include the baseline model in the ensemble, but give it less weight than the other forecasts, we would use the weights = model_weights argument, where model_weights is a data.frame with a model_id column containing each unique model_id and a weight column. The argument weights_col_name can optionally be used in the case where the data.frame defines model weights in a column named something other than weight.
model_weights <- data.frame(model_id = c("UMass-ar", "UMass-gbq", "simple_hub-baseline"),
weight = c(0.4, 0.4, 0.2))
weighted_mean_ens <- hubEnsembles::simple_ensemble(model_outputs,
weights = model_weights,
model_id = "hub-ensemble-weighted-mean")
head(weighted_mean_ens)
## # A tibble: 6 × 8
## model_id origin_date horizon location target output_type output_type_id value
## <chr> <date> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 hub-ense… 2022-12-05 -6 20 inc c… median NA 37.6
## 2 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.01 17.6
## 3 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.025 18.8
## 4 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.05 20.4
## 5 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.1 22
## 6 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.15 24.6
linear_poolWe can also generate a linear pool ensemble, or distributional mixture, using the linear_pool() function; this function can only be applied to predictions with an output_type of mean, quantile, cdf, or pmf. Our example hub includes median output type, so we exclude it from the calculation.
linear_pool_ens <- hubEnsembles::linear_pool(model_outputs %>%
filter(output_type != "median"),
model_id = "hub-ensemble-linear-pool")
head(linear_pool_ens)
## # A tibble: 6 × 8
## model_id origin_date horizon location target output_type output_type_id value
## <chr> <date> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.01 0
## 2 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.025 0
## 3 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.05 7.01
## 4 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.1 21.1
## 5 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.15 25.6
## 6 hub-ense… 2022-12-05 -6 20 inc c… quantile 0.2 27.6
As described above, for quantile model outputs, the linear_pool function approximates the full probability distribution for each component prediction using the value-quantile pairs provided by that model model, and then obtains quasi-random samples from that distributional estimate. The number of samples drawn from the distribution of each component model defaults to 1e4, but this can be changed using the n_samples argument.
plt_model_outputs = model_outputs %>%
filter(location == "US",
origin_date == "2022-12-12",
output_type == "quantile",
output_type_id %in% c(0.05, 0.25, 0.5, 0.75, 0.95)) %>%
mutate(target_date = origin_date + horizon,
output_type_id = paste0("Q", as.numeric(output_type_id)*100)) %>%
spread(output_type_id, value)
plt_ens_outputs <- bind_rows(
mean_ens %>%
filter(location == "US",
origin_date == "2022-12-12",
output_type == "quantile",
output_type_id %in% c(0.05, 0.25, 0.5, 0.75, 0.95)) %>%
mutate(target_date = origin_date + horizon,
output_type_id = paste0("Q", as.numeric(output_type_id)*100),
ens_fn = "simple_ensemble"),
linear_pool_ens %>%
filter(location == "US",
origin_date == "2022-12-12",
output_type == "quantile",
output_type_id %in% c(0.05, 0.25, 0.5, 0.75, 0.95)) %>%
mutate(target_date = origin_date + horizon,
output_type_id = paste0("Q", as.numeric(output_type_id)*100),
ens_fn = "linear_pool")
) %>%
spread(output_type_id, value)
lims = c(min(c(plt_model_outputs$Q5, plt_ens_outputs$Q5)),
max(c(plt_model_outputs$Q95, plt_ens_outputs$Q95)))
p1 = ggplot(data = plt_model_outputs,
aes(x = target_date)) +
geom_point(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_line(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_vline(aes(xintercept = as.Date("2022-12-06"))) +
geom_ribbon(aes(ymin = Q5, ymax = Q95), alpha = 0.2) +
geom_ribbon(aes(ymin = Q25, ymax = Q75), alpha = 0.2) +
geom_line(aes(y = Q50), size = 0.9) +
facet_wrap(vars(model_id), ncol = 1)+
scale_y_continuous(labels = comma,
limits = lims,
name = "US incident hospitalizations") +
theme_bw() +
theme(axis.title.x = element_blank(),
legend.position = "none",
legend.title = element_blank())
p2 = ggplot(data = plt_ens_outputs,
aes(x = target_date)) +
geom_point(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_line(data = target_data %>%
filter(location == "US",
time_idx >= "2022-10-01",
time_idx <= "2023-03-01"),
aes(x = time_idx, y = value)) +
geom_vline(aes(xintercept = as.Date("2022-12-06"))) +
geom_ribbon(aes(ymin = Q5, ymax = Q95, fill = model_id), alpha = 0.2) +
geom_ribbon(aes(ymin = Q25, ymax = Q75, fill = model_id), alpha = 0.2) +
geom_line(aes(y = Q50, color = model_id), size = 0.9) +
scale_y_continuous(labels = comma,
limits = lims,
name = "US incident hospitalizations") +
theme_bw() +
theme(axis.title.x = element_blank(),
legend.position = "bottom",
legend.title = element_blank())
plot_grid(p1, p2,
align = "h", axis = "b")
To demonstrate the utility of the hubEnsembles package and the differences between the two ensembling functions, we examine the case of predicting weekly influenza hospitalizations in the US.
Since 2013 the US Center for Disease Control and Prevention (CDC) has been soliciting forecasts of seasonal influenza from modeling teams through a collaborative challenge called FluSight (CDC 2023). Here we combine these forecasts using various aggregation methods to take advantage of the greater consistency (Hibon and Evgeniou 2005) and accuracy (Clemen 1989; Timmermann 2006) of ensembles over individual models. In particular, we examine four equally-weighted ensembling methods implemented through simple_ensemble() and linear_pool(): a quantile (arithmetic) mean, a quantile median, a linear pool with normal tails, and a linear pool with lognormal tails.
The component forecasts used to generate the ensembles consist of those submitted during influenza seasons 2021-2022 and 2022-2023, the only two complete seasons for which FluSight collected quantile forecasts for a target of weekly incident flu hospitalizations at the time of this writing. These predictions are not yet formatted according to hubverse standards but are easily transformed using the as_model_out_tbl() function from the hubUtils package. Below we print the first six rows of the transformed forecasts made on May 15, 2023 for California.
flu_forecasts_raw <- readr::read_rds("data/flu_forecasts_raw.rds")
forecast_data_hub <- flu_forecasts_raw |>
dplyr::rename(forecast_date=timezero, location=unit) |>
tidyr::separate(target, sep=" ", convert=TRUE, into=c("horizon", "target"), extra="merge") |>
as_model_out_tbl(
model_id_col = "model",
output_type_col = "class",
output_type_id_col = "quantile",
value_col = "value",
sep = "-",
trim_to_task_ids = FALSE,
hub_con = NULL,
task_id_cols = c("forecast_date", "location", "horizon", "target"),
remove_empty = TRUE
)
forecast_data_hub |>
dplyr::filter(location == "06") |>
head() |>
knitr::kable()
| model_id | forecast_date | season | location | horizon | target | output_type | output_type_id | value |
|---|---|---|---|---|---|---|---|---|
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.010 | 0.000000 |
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.025 | 0.000000 |
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.050 | 4.387663 |
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.100 | 10.779742 |
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.150 | 14.898658 |
| CADPH-FluCAT_Ensemble | 2023-05-15 | 2022-2023 | 06 | 1 | wk ahead inc flu hosp | quantile | 0.200 | 18.263880 |
Within the now-properly formatted model outputs, the task ID variables are horizon, location, target, and forecast date (the date on which the forecast was made). All of the forecasts have a quantile output type with 23 total unique output type IDs \[Q = \{.010, 0.025, .050, .100, \cdots, .900, .950, .990\}.\] The values of these quantile forecasts are non-negative. The resulting ensemble forecasts will have the same task ID variables and model output specifications.
Next the component model outputs are combined using the following code to generate each model, then repeated for every forecast date.
mean_ensemble <- forecast_data_hub |>
filter(model_id != "Flusight-baseline") |>
hubEnsembles::simple_ensemble(weights=NULL, agg_fun = "mean", model_id="mean-ensemble")
median_ensemble <- forecast_data_hub |>
filter(model_id != "Flusight-baseline") |>
hubEnsembles::simple_ensemble(weights=NULL, agg_fun = "median", model_id="median-ensemble")
lp_normal <- forecast_data_hub |>
filter(model_id != "Flusight-baseline") |>
hubEnsembles::linear_pool(weights=NULL, n_samples = 1e5, model_id="lp-normal", tail_dist="norm")
lp_lognormal <- forecast_data_hub |>
filter(model_id != "Flusight-baseline") |>
hubEnsembles::linear_pool(weights=NULL, n_samples = 1e5, model_id="lp-lognormal", tail_dist="lnorm")
We score the ensemble forecasts for every unique combination of task ID variables using several common metrics in forecast evaluation, including weighted interval score (WIS) (Bracher et al. 2021), mean absolute error (MAE), 50% prediction interval (PI) coverage, and 95% PI coverage. Of these metrics, WIS and PI coverage evaluate probabilistic forecasts while MAE evaluates point forecasts. In this analysis, we take the 0.5 quantile to be the point forecast.
WIS measures how consistent a set of prediction intervals is with the true value and is an alternative to common proper scoring rules like the Log Score and Continuous Ranked Probability Score, which can’t be evaluated directly for quantile forecasts (Bracher et al. 2021). Since WIS is made up of three component penalties—one for each of spread, over-prediction, and under-prediction—a lower value indicates a more accurate forecast (Bracher et al. 2021). The \((1-\alpha)*100\)% PI coverage measures the proportion of the time that PIs at that nominal level included the true value, which provides information about whether a forecast has accurately characterized the uncertainty of future observations. Achieving approximately nominal (\((1-\alpha)*100\)%) coverage indicates a well-calibrated forecast. MAE measures the average absolute error of a set of forecasts against the true value; smaller values of MAE indicate better forecast accuracy.
The results show that the quantile median ensemble had the best overall performance in terms of WIS and MAE (and the relative versions of these metrics) with above-nominal coverage rates (Table 5.1). The two linear opinion pools had very similar performance to each other. These methods had the second-best performance as measured by WIS and MAE, but they had the highest 50% and 95% coverage rates, with empirical coverage that was well above the nominal coverage rate. The quantile mean performed the worst of the ensembles with the highest MAE, which was substantially different from that of the other ensembles.
| model | wis | mae | cov50 | cov95 | rwis | rmae |
|---|---|---|---|---|---|---|
| median-ensemble | 18.158 | 27.36 | 0.597 | 0.922 | 0.794 | 0.933 |
| lp-normal | 19.745 | 27.932 | 0.709 | 0.99 | 0.863 | 0.953 |
| lp-lognormal | 19.747 | 27.933 | 0.708 | 0.99 | 0.863 | 0.953 |
| mean-ensemble | 20.18 | 29.582 | 0.595 | 0.889 | 0.882 | 1.009 |
| Flusight-baseline | 22.876 | 29.315 | 0.604 | 0.881 | 1 | 1 |
Plots of the models’ forecasts can aid our understanding about the origin of these accuracy differences. For example, the linear opinion pools, which had the highest coverage rates, consistently had some of the widest prediction intervals. The median ensemble, which had the best WIS, seems to have best-balanced interval width overall, with narrower intervals than the linear pools that achieved near-nominal coverage on average across all time points. The quantile mean’s interval widths could vary, though it usually had narrower intervals than the linear pools. However, this model’s point forecasts demonstrated a larger error margin compared to the other ensembles, especially at longer horizons. This can be seen in Figure 5.1 for the 4-week ahead forecast in California following the 2022-23 season peak on December 5, 2022. Here the quantile mean is predicting a continued increase in hospitalizations while the other models predict a flat or slightly downward trend.
Figure 5.1: One to four week ahead forecasts for select dates plotted against truth data for Vermont and California.
When examining the forecasts week-by-week, the ensemble models tend to have similar MAE values during the entire time period. This is especially prevalent at the one-week ahead horizon, with slight divergence in MAE values for certain weeks at the four week ahead horizon Figure 5.2. However, the models show greater differences for the other two metrics, particularly during times of rapid change (Figure 5.2), with the scores aligning near-perfectly with the true weekly incident hospitalizations. In fact, the linear pools have a lower WIS than the median ensemble at the one week ahead forecast horizon for over a third of forecast dates (11 weeks) during the 2022-2023 season: from October 17, 2022 to December 12, 2022; January 2, 2023; and January 9, 2023. These dates span the rapid rise and fall of incident hospitalizations due to influenza surrounding the season’s peak, with the largest differences in WIS occurring on November 28, December 5, December 12, January 2, and January 9. Additionally, the PI coverage rates for the linear pools were at least as large as the coverage rates of the other models throughout the entire period of analysis at both the 1 and 4 week ahead forecast horizons (see Figure 5.3).
Figure 5.2: Average h-week ahead WIS and MAE by forecast date for each model for all locations (except the US national location), split by season for readability. Average truth data across all locations is plotted in black.
Figure 5.3: Average h-week ahead 95% PI coverage and truth by forecast date for each model for all locations (except the US national location), split by season for readability.
From these results, we can see that different ensembling methods perform best under different circumstances, though in this analysis all of the ensemble variations outperformed the baseline model. While the quantile median had the best overall results for WIS, MAE, 50% PI coverage, and 95% PI coverage, other models may perform better from week-to-week for each metric. For example, the linear pools achieved the lowest WIS during the period of rapid change in incident flu hospitalizations surrounding the season’s peak in early December 2022.
Depending on the target being forecast or the goal of forecasting, the user may opt to choose a particular aggregation method that is most aligned with the objectives. One case may call for prioritizing above-nominal coverage rates while another may be more interested in accurate point forecasts. The simple_ensemble and linear_pool functions and the ability to specify component model weights and an aggregation function for simple_ensemble allow users to implement a variety of ensemble methods.
Ensembles of independent models are a powerful tool to generate more accurate and more reliable forecasts of future outcomes than a single model alone. Here, we have demonstrated how to utilize hubEnsembles, a simple and flexible framework to combine individual model forecasts and create ensemble predictions. When using hubEnsembles, it is important to carefully choose an ensemble method that is well suited for the situation. Although there may not be a “best” method, matching the properties of a given ensemble method with the features of the component models will likely yield best results. For example, we showed for forecasts of seasonal influenza in the US, the quantile median ensemble performed best overall, but the linear pool method had advantages during periods of rapid change, when outlying component forecasts were likely more important. Importantly, all ensemble methods outperformed the baseline model. These performance improvements from ensemble models motivate the use of a “hub-based” approach to prediction for infectious diseases and in other fields. Fitting within the larger suite of “hubverse” tools that support such efforts, the hubEnsembles package provides important software infrastructure for leveraging the power of multi-model ensembles.